feat(providers): prompt caching for Anthropic + Azure-Anthropic#5101
feat(providers): prompt caching for Anthropic + Azure-Anthropic#5101waleedlatif1 wants to merge 6 commits into
Conversation
Mark the static request prefix (system prompt + tools) with an ephemeral cache_control breakpoint so repeated calls — agent tool-loops and multi-turn — reuse the cached prefix (~90% cheaper cached input + lower latency). Azure- Anthropic inherits this via the shared core. - New providers/prompt-cache.ts gate: only caches when the static prefix is large enough to be cacheable AND likely reused (tools present, or a large system prompt), so a one-shot tool-less call never pays the cache-write surcharge. Kill switch: PROMPT_CACHE_DISABLED=true. - anthropic/core.ts: convert system string -> a cached text block (after the structured-output concat, which assumes a string) and tag the last tool. Uses 2 of Anthropic's 4 breakpoints; the tool-loop reuses the tagged payload. - Outputs are unchanged; cost accounting already reads cache_read/creation tokens (buildAnthropicSegmentTokens), so usage stays accurate. Matches the AI SDK / LangChain / Spring AI convention (explicit breakpoints for Claude; automatic for OpenAI/Gemini). Bedrock + OpenRouter to follow (they need cache-token accounting alongside).
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryLow Risk Overview A new Unit tests cover the gate, payload mutation edge cases, and end-to-end payload capture on the streaming/no-tools path. Reviewed by Cursor Bugbot for commit b9a453d. Configure here. |
Greptile SummaryThis PR enables Anthropic prompt caching for the Anthropic and Azure-Anthropic providers by stamping
Confidence Score: 4/5Safe to merge with awareness that cache token fees are still excluded from the top-level ProviderResponse cost totals. The caching logic itself is correct and well-tested. However, now that caching is always on, every warm-cache call reports an inaccurate cost: apps/sim/providers/anthropic/core.ts — all three token-accumulation sites (streaming-no-tools path ~line 422, non-streaming initial response ~line 860, and tool-loop iteration ~line 1141) need cache token accounting added to match what Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[executeAnthropicProviderRequest] --> B[Build payload\nsystem = systemPrompt]
B --> C{responseFormat?}
C -->|prompt-based| D[Append schema to\npayload.system]
C -->|native / none| E[No mutation]
D --> F[applyAnthropicPromptCache\npayload, tools, request.systemPrompt]
E --> F
F --> G{shouldCacheStaticPrefix\ngateSystem, hasTools, toolsApproxChars}
G -->|prefixTokens < 1024\nor no system| H[No-op: return]
G -->|prefixTokens >= 1024\nhasTools or large system| I{payloadSystem\nnon-empty?}
I -->|yes| J[payload.system = TextBlockParam\nwith cache_control: ephemeral]
I -->|no - relocated| K[Skip system block]
J --> L{tools present?}
K --> L
L -->|yes| M[tools lastIndex.cache_control = ephemeral]
L -->|no| N[Done]
M --> N
N --> O[Add thinking config if requested]
O --> P[API call]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[executeAnthropicProviderRequest] --> B[Build payload\nsystem = systemPrompt]
B --> C{responseFormat?}
C -->|prompt-based| D[Append schema to\npayload.system]
C -->|native / none| E[No mutation]
D --> F[applyAnthropicPromptCache\npayload, tools, request.systemPrompt]
E --> F
F --> G{shouldCacheStaticPrefix\ngateSystem, hasTools, toolsApproxChars}
G -->|prefixTokens < 1024\nor no system| H[No-op: return]
G -->|prefixTokens >= 1024\nhasTools or large system| I{payloadSystem\nnon-empty?}
I -->|yes| J[payload.system = TextBlockParam\nwith cache_control: ephemeral]
I -->|no - relocated| K[Skip system block]
J --> L{tools present?}
K --> L
L -->|yes| M[tools lastIndex.cache_control = ephemeral]
L -->|no| N[Done]
M --> N
N --> O[Add thinking config if requested]
O --> P[API call]
|
…tubEnv - anthropic/core.ts: gate on request.systemPrompt instead of payload.system, so the no-messages path (where the system text is relocated into a user message and payload.system is blanked) still caches the tools prefix. (Cursor review) - prompt-cache.test.ts: manage the kill-switch env via vi.stubEnv/unstubAllEnvs instead of assigning undefined (which coerces to "undefined" and leaks across workers). Addresses the Greptile finding while satisfying biome's noDelete rule.
|
@greptile review |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 3a44936. Configure here.
…elper - Remove the PROMPT_CACHE_DISABLED kill switch — prompt caching is always on. - Extract the Anthropic tagging into applyAnthropicPromptCache(payload, tools, systemPrompt) in anthropic/utils.ts: one place that gates and mutates the system block + last tool, replacing the two inline blocks in core.ts. - Add direct unit tests for the helper (system→cached block, last-tool tagged, relocated/blanked-system still tags tools, below-threshold and tool-less cases untouched) so the actual payload mutation is verified, not just the gate. No behavior change to outputs; verified on vitest 4.1.8 (CI's version).
|
@greptile review |
|
@cursor review |
…m and request prompt Gate on max(final payload.system, request.systemPrompt) so caching fires both when the no-messages path blanks payload.system (size via the request prompt) and when prompt-based structured output appends a large schema to payload.system (size via the final system string). Add a test for the schema-appended case. Caught by Cursor Bugbot.
|
@greptile review |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 38140c7. Configure here.
Drop the inline // comments in favor of TSDoc on the helper/gate. The gate-sizing and call-ordering rationale now lives in applyAnthropicPromptCache's TSDoc; no behavior change.
|
@greptile review |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 5e90631. Configure here.
Drives the real executeAnthropicProviderRequest down the streaming path with only the client injected via the createClient seam (real models/utils/attachments), and asserts the request payload handed to messages.create carries a cache_control-tagged system block for a large prompt and a plain string for a small one. Closes the end-to-end wiring gap (AI-SDK-style request-body capture).
|
@greptile review |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit b9a453d. Configure here.
Summary
cache_controlbreakpoint for Anthropic (and Azure-Anthropic, which shares the core), so repeated calls — agent tool-loops and multi-turn chats — reuse the cached prefix: ~90% cheaper cached input + lower latency.applyAnthropicPromptCache(payload, tools, systemPrompt)(anthropic/utils.ts), which gates on whether caching is worthwhile and mutates the system block + last tool.When it caches (the gate)
providers/prompt-cache.tsonly applies breakpoints when the static prefix is large enough to be cacheable and likely reused (tools present, or a large system prompt). A one-shot, tool-less call is skipped so it never pays the cache-write surcharge for a prefix that's never read back. The gate is sized on the larger of the finalpayload.system(which may include appended structured-output schema) and the originalrequest.systemPrompt(non-empty even when the no-messages path relocates it into a user message).Why this is safe
cache_read_input_tokens/cache_creation_input_tokens(buildAnthropicSegmentTokens).Standard practice
Matches the AI SDK / LangChain / Spring AI / Pydantic AI / LiteLLM convention: explicit cache breakpoints for Claude (Anthropic/Bedrock), automatic server-side caching for OpenAI/Gemini/etc. We auto-place breakpoints on the system+tools prefix (the convergent "SYSTEM_AND_TOOLS" strategy), so users don't hand-mark anything.
Type of Change
Testing
bun run type-checkcleanapplyAnthropicPromptCachepayload mutation across all paths: system→cached block, last-tool tagged, relocated/blanked system, schema-appended system, below-threshold/tool-less no-op), verified on vitest 4.1.8bun run lintclean ·bun run check:api-validationpassedFollow-ups (not in this PR)
cachePoint) and OpenRouter (cache_controlpassthrough for Claude) — these need cached-token accounting added alongside (Bedrock doesn't readcacheReadInputTokens/cacheWriteInputTokens), so shipping caching there without it would mis-report cost.prompt_cache_keyfor OpenAI/Azure.Checklist